A Clustering Strategy for the Key Segmentation of Musical Audio
نویسندگان
چکیده
Key changes are common in Western classical music. The precise segmentation of a music piece at instances where key changes occur allows for further analysis, like self-similarity analysis, chord recognition, and several other techniques that mainly pertain to the characterization of music content. This article examines the automatic segmentation of audio data into parts composed in different keys, using clustering on chroma-related spaces. To this end, the k-means algorithm is used and a methodology is proposed so that useful information about key changes can be derived, regardless of the number of clusters or key changes. The proposed methodology is evaluated by experimenting on the segmentation of recordings of existing compositions from the Classic-Romantic repertoire. Additional analysis is performed using artificial data sets. Specifically, the construction of artificial pieces is proposed as a means to investigate the potential of the strategy under discussion in predefined key-change scenarios that encompass different musical characteristics. For the existing compositions, we compare the results of our proposed methodology with others from the music information retrieval literature. Finally, although the proposed methodology is only capable of locating key changes and not the key identities themselves, we discuss results regarding the labeling of a composition’s key in the located segments. The notion of tonality is fundamental in Western music. Most aspects of tonal analysis are based on the relations between pitches, provided a context: the composition’s key. The key specifies a set of notes (a seven-note subset of the twelve notes of the chromatic scale) that are perceived as being related, although the utilization of key differs according to musical style and historical period, among other factors. Western classical music typically changes key, or modulates (in the broadest sense), during the course of the piece. Content segmentation and characterization of such music are aided by identifying the composition’s main key and the related keys into which it is likely to modulate. Provided an accurate segmentation of a piece at Computer Music Journal, 37:1, pp. 52–69, Spring 2013 doi:10.1162/COMJ a 00168 c © 2013 Massachusetts Institute of Technology. the locations where key changes occur, several other tasks can be performed more accurately, such as self-similarity analysis (Chai 2005) and chord recognition (Lee and Slaney 2007), among others. Additionally, the availability of an enormous number of digital music recordings makes musical content analysis an important tool for automatically categorizing large data sets. Towards this aim, the detection of points where key changes occur can help define the local characteristics of pieces, providing a basis for further semantic analysis. There are two main branches of the field of music information retrieval: research involving symbolic music representations such as MIDI data, and research involving nonsymbolic data, namely, audio. This article examines the automatic segmentation of audio data, although some concepts can apply equally to segmentation of symbolic data. Specifically, the focus here is segmentation 52 Computer Music Journal of a piece into parts composed in different keys, through clustering on chroma-related spaces. (The concept of chroma, also known as pitch class, identifies a pitch in the Western equal-tempered tuning, disregarding the pitch’s register, i.e., the octave in which the pitch occurs.) Although the proposed methodology is capable only of key segmentation and not labeling (i.e., it locates key changes but not key identities), we also present results of labeling via simple template-matching techniques. The k-means algorithm (Hartigan and Wong 1979) is used and a methodology is proposed so that useful information about key changes can be derived, regardless of the number of clusters or the number of key changes. The proposed strategy relies solely on geometric properties of the chroma space and does not need training, avoiding the potential hazard of being ineffective on musical styles different from the ones it has been trained on. Experimental results are reported on segmentation of (1) recordings of real compositions, together with a comparison between our proposed methodology and previous ones, and (2) artificial music data sets, the construction of which is described subsequently. The construction of these artificial pieces is intended to provide an additional tool for examining the behavior of the proposed approach under “laboratory” conditions, allowing the analysis of the model’s capabilities using large data sets of pieces with predefined structure. With these artificial data sets, multiple key-change scenarios were included in order to perform an exhaustive efficiency assessment of the clustering strategy. Finally, key labeling of the segmented parts was applied to the data set of real compositions, with the goal of providing a robust and accurate framework for musical content characterization. Previous Work, Motivation, and Aims Several approaches have provided significant insights into the problem of automatic detection of key changes in audio. Some work has extended key detection to the detection of local keys, i.e., areas within a piece that are composed using different keys. These approaches can be divided in two main categories. The first uses a priori information about the expected chroma constitution of keys, either in the form of key templates (Krumhansl 1990; Temperley 2004, 2006), or trained/tuned hidden Markov models (HMM) (Chai and Vercoe 2005; Noland and Sandler 2006; Papadopoulos and Peeters 2009, 2012). The second category includes methods that explore geometric properties of the pitch space to detect key changes without prior information about key templates or expected key changes (Chuan and Chew 2007; Izmirli 2007; Chew 2002). Similar techniques have also been proposed for harmonic segmentation—i.e., dividing a piece of music into a sequence of distinct chords (Harte, Sandler, and Gasser 2006), instead of key segmentation discussed in this work. Finally, a weighted graph approach was also tested for simultaneous chord and key recognition (Rocher et al. 2010). Here a larger data set of 174 pieces was used, but with a small mean number of key changes (1.69) per piece. This last work, however, does not report on segmentation accuracy results. All works related to localized detection of key changes utilize the chroma information of a piece and look for contiguous chroma segments that could belong to a single key. These chroma segments are expressed as chroma vectors. These are vectors that incorporate information about the presence and intensity of the twelve chroma within short segments of a piece. The HMM-related approaches define the tonal constitution of each chroma vector (in terms of HMM, the emission) by associating its probability of belonging to a certain key with a transition probability from the key of the previous vector. Additional information of higher musical structure (i.e., chords) has also been utilized to refine key and key transition probabilities (Papadopoulos and Peeters 2009, 2012). The work described in Izmirli (2007) utilizes nonnegative matrix factorization (NMF) on the chroma matrix V of a piece to produce a set of patterns W and an activation matrix H, so that V = WH. The pattern matrix encapsulates information related to the identity of all keys in V, and the activation of each pattern, shown in H, reflects the location of each key. A limitation of Kaliakatsos-Papakostas et al. 53 this methodology is that the number of keys in the piece needs to be known in advance, so that each pattern of W corresponds to a key. The advantage of our methodology, regarding segmentation, is that there is no necessity for the number of keys to be known in advance. An additional advantage of our approach is that when it uses NMF and principal component analysis (PCA), the number of projecting dimensions does not have a crucial meaning (i.e., does not reflect the number of keys), but serves entirely as a dimension-reduction mechanism to facilitate clustering. Besides the key segmentation per se, an additional motivation of the work presented here is to emphasize the potential of our methodology from a musical perspective. Such a task should not be performed on individual music pieces, however, because this would restrict the scope of the produced results to the analysis of these music works. To tackle this problem, we propose the construction of data sets of “artificial” music, which incorporate the desired pre-specified musical structure. This approach enables us to produce a large number of different test cases with diverse musical characteristics, thereby allowing a deeper analysis of the effectiveness of a model under different tonal conditions. The Proposed Strategy In the proposed approach we examine the detection of key changes in musical audio content using clustering. Clustering is a way to separate a collection of objects into groups, such that objects belonging to the same group are more similar than objects of different groups. This technique has been used in a wide range of applications (Jain, Murty, and Flynn 1999), with the notion of object similarity being defined in dependence on a specific problem. In the proposed application, similarity is measured in the chromatic tonal domain, calculated for short musical segments. This section analyzes the proposed clustering methodology for key segmentation of music recordings. The presentation includes a parallel demonstration of the tasks described, based on Dvorak’s Humoresque No. 7, which is composed in two keys: G-flat major and F-sharp minor. Feature Representation In our approach, we use the chroma energy normalized statistics (CENS) (Müller, Kurth, and Clausen 2005), obtained using the Chroma Toolbox (Müller 2010; Müller and Ewert 2011) for MATLAB. The sampling rate of the pieces that were used for this research was 44,100 Hz. The methodology of the Chroma Toolbox, however, uses down-sampled versions of the signals in order to achieve greater resolution accuracy at lower frequencies (Müller and Ewert 2011). Thus, with the utilization of a constant-Q multirate filter centered at the frequency of each pitch, the chroma profile is evaluated within frames of 0.1 sec. The CENS representation is a statistically smoothed transformation of this chroma profile, achieved through quantization and component-wise convolution of the chroma profile of each frame with its neighbors, using a Hanning window. The window size was selected to be w = 45 frames (4.5 sec) in order to have largescale tracking of the chroma activity, avoiding potential misclassification of frames caused by articulations or chromatic passages. It could be argued that the window size should be relative to the tempo of the piece rather than a fixed time unit. This would be worth examining in future research. In a mathematical sense, the CENS representation transforms the recorded piece into a real matrix C ∈ R12×F , with twelve rows—one for each chroma— and F columns, with F being the number of time frames. Our goal is to construct an algorithm that uses C to detect the time position of key changes through clustering all frames, not only in the twelvedimensional tonal space of C, but also in spaces of reduced dimension derived using PCA and NMF. Dimension Reduction in the Chroma Space Before presenting the proposed approach, we briefly but rigorously provide a description of the PCA and NMF dimension reduction techniques and their parts that are associated with clustering. For the PCA, we obtain the covariance matrix S of Cc, which is the centralized per row C matrix (a matrix is centralized per row if we subtract the mean 54 Computer Music Journal −1 −0.5 0 0.5 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 PCA 1st principal component 2n d pr in ci pa l c om po ne nt 0 0.01 0.02 0.03 0.04 0.05 0.06 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 NMF 1st NMF dimension 2n d N M F di m en si on (b) (a) Figure 1. Two-dimensional projections of each frame
منابع مشابه
Detection of lung cancer using CT images based on novel PSO clustering
Lung cancer is one of the most dangerous diseases that cause a large number of deaths. Early detection and analysis can be very helpful for successful treatment. Image segmentation plays a key role in the early detection and diagnosis of lung cancer. K-means algorithm and classic PSO clustering are the most common methods for segmentation that have poor outputs. In t...
متن کاملImpact of audio segmentation and segment clustering on automated transcription accuracy of large spoken archives
This paper addresses the influence of audio segmentation and segment clustering on automatic transcription accuracy for large spoken archives. The work forms part of the ongoing MALACH project, which is developing advanced techniques for supporting access to the world’s largest digital archive of video oral histories collected in many languages from over 52000 survivors and witnesses of the Hol...
متن کاملAn unsupervised system for the synthesis of variations from audio percussion patterns
A system is introduced that learns the structure of an audio recording of a rhythmical percussion fragment in an unsupervised manner and synthesizes musical variations from it. The procedure consists of 1) segmentation, 2) symbolization (feature extraction, clustering, sequence structure analysis, temporal alignment), and 3) synthesis. The symbolization step yields a sequence of event classes. ...
متن کاملHigh Performance Implementation of Fuzzy C-Means and Watershed Algorithms for MRI Segmentation
Image segmentation is one of the most common steps in digital image processing. The area many image segmentation algorithms (e.g., thresholding, edge detection, and region growing) employed for classifying a digital image into different segments. In this connection, finding a suitable algorithm for medical image segmentation is a challenging task due to mainly the noise, low contrast, and steep...
متن کاملAudio Indexing Using Speaker Identiication
In this paper, a technique for audio indexing based on speaker identiication is proposed. When speakers are known a priori, a speaker index can be created in real time using the Viterbi algorithm to segment the audio into intervals from a single talker. Segmentation is performed using a hidden Markov model network consisting of interconnected speaker sub-networks. Speaker training data is used ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computer Music Journal
دوره 37 شماره
صفحات -
تاریخ انتشار 2013